Linear regression Lab

Andrea Sánchez-Tapia , Paulinha Lemos-Costa , Sara Mortara
2024-02-01

Simple linear models

Linear regressions in R assume a linear relationship between one or more predictor variables \(X_i\) and a response variable \(y\).

\[y_i \sim Normal(\mu_i, \sigma) \\ \mu_i = \alpha + \beta x_i\]

The first element of this model describes how the response variable \(Y\) can be modeled as a normal random variable with mean and standard deviation.

The second element of this model adds the effect of the predictor variable \(X\) on \(Y\), based on the causal relationship \(X \rightarrow Y\). In this simple case, the relationship is linear and can be described by the equation of a line with intercept \(\alpha\) and slope \(\beta\).

Generating the dataset

In the introductory lecture for linear models, we used an example dataset to fit a simple regression model.

The dataset itself was generated according to the specifications of the model above. We are going to recreate the dataset and some of the calculations presented during the class.

set.seed(4) #ensures the result is reproducible. The random number generation can vary between Operative systems so maybe there will be different datasets in the classroom
N = 30
x <- runif(N, 0, 5)
n <- length(x)
mu <- 1.2 + 3.5 * x
y <- rnorm(N, mean = mu, sd = 3)
lm_data <- data.frame(x,  mu, y, res = y - mu)

Create a plot for the dataset

plot(y ~ x, pch = 19)

As you can see, plotting the data suggests a linear relationship between these two variables

Regression coefficients

We can calculate the correlation coefficients using several methods. The Ordinary Least Squares (OLS) method is the most common, and provides analitical solutions for the coefficients. The slope is the ratio of the covariance between \(x\) and \(y\) and the variance of \(x\). The intercept is the mean of \(y\) minus the product of the slope and the mean of \(x\).

\[\widehat{\beta}= \frac{Cov(x, y)}{Var(y)} = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]

and

\[ \hat{\alpha} = \bar{y} - \hat{\beta}\bar{x} \]

beta_hat <- sum((x - mean(x)) * (y - mean(y))) / sum((x - mean(x)) ^ 2)
beta_hat
[1] 3.133357
# This is just the ratio of the covariance between x and y and the variance of x
cov(x, y) / var(x)
[1] 3.133357
# Since mu = alpha + beta*x: 
alpha_hat <- mean(y) - beta_hat * mean(x)
alpha_hat
[1] 3.014409

Add a regression line with the values of the coefficients you calculated

plot(y ~ x, pch = 19)
abline(a = alpha_hat, b = beta_hat, col = 2, lwd = 2)

The sample is not the population

When we generated the dataset (here), \(\mu\) was specified as:

mu <- 1.2 + 3.5 * x_i

This means \(\alpha = 1.2\) and \(\beta = 3.5\). How do these values compare to the coefficients you just calculated?

Add a population line to the plot you previously created by completing the code below:

plot(y ~ x, pch = 19)
abline(a = alpha_hat, b = beta_hat, col = 2)
abline(a = ..., b = ..., col = "blue")

See the code by clicking Show code:

Show code
plot(y ~ x, pch = 19)
abline(a = alpha_hat, b = beta_hat, col = 2, lwd = 2)
abline(a = 1.2, b = 3.5, col = "blue", lwd = 2)

Fitting a linear model in R

Function lm() fits regression models using Ordinary Least Squares. The values of the coefficients can be found using function coef().

mod <- lm(y ~ x)
coef(mod)
(Intercept)           x 
   3.014409    3.133357 

Compare this result with the calculations you did previously.

Model diagnostics

Before making inferences with your fitted model, it is also a good practice to make some model diagnostics. For linear regression, the most basic diagnostic is to check if the residuals follows a normal distribution with zero mean and a constant standard deviation.

Because the normal distribution has a parameter that corresponds to the expected value, and that is independent of the other parameter, the linear regression model we have been writing as:

\[ \begin{align} y_i & \sim Normal(\mu_i, \sigma) \\ \mu_i & = \sum_{j=0}^{J} \beta_{j} x_{ij} \\ \end{align} \]

Can be also written as:

\[ \begin{align} y_i & = \sum \beta_j x_ij + \epsilon \\ \epsilon & \sim Normal(\mu, \sigma) \\ \end{align} \]

That is, the values of the response \(y_i\) are each a weighted sum of the predictor variables \(x_j\) plus a random Gaussian residual \(\epsilon\). To express the random variation symmetrically around the expected value, this residual has a mean value of zero, and a fixed standard deviation.

To check if this assumption is true, we plot the residuals of the fitted model as a function of the predicted values. This plot should show a point cloud of constant width around zero in the Y-axis. You can check this assumption applying the function plot to the object that stored the fitted model:

plot(mod, which = 1)

You can also check the normality of the residuals with a qq-plot. Normal data should lie in a straight line in this plot:

plot(mod, which = 2)

What is your overall evaluation of the normality of the residuals?

You can find an excellent explanation of these and other diagnostic plots for linear regression here